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Abstract 

The combination of a non-overlapping Schwarz preconditioner and the Hybrid Monte Carlo 
(HMC) algorithm is shown to yield an efficient simulation algorithm for two-flavour lattice 
QCD with Wilson quarks. Extensive tests are performed, on lattices of size up to 32 x 24^, 
with lattice spacings a ~ 0.08 fm and at bare current-quark masses as low as 21 MeV. 



1. Introduction 

At present, perhaps the greatest obstacle in lattice QCD is the fact that the efficiency 
of the estabhshed simulation algorithms rapidly decreases when the continuum limit 
is approached and the masses of the light quarks are scaled towards their physical 
values The dynamics of these algorithms is still not fully understood, but it is 

quite clear that the poor scaling behaviour is driven by the condition number of the 
lattice Dirac operator, which grows inversely proportionally to the lattice spacing 
and the quark mass. 

Preconditioning is usually perceived as a technique for the efficient solution of ill- 
conditioned systems of linear equations [12]. This kind of preconditioning is routinely 
applied in lattice QCD to accelerate the solver for the lattice Dirac equation. While 
the solver is a central element of the HMC simulation algorithm [13], it is also 
possible to precondition this algorithm itself, using another preconditioner perhaps, 
by factorizing the quark determinant into the determinants of the preconditioners 
and the preconditioned Dirac operator. The magnitude of the quark force terms in 
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the molecular dynamics equations is then often reduced, which allows the associated 
integration step sizes to be set to larger values and thus leads to an acceleration of 
the algorithm [14-18,10]. 

In the present paper the effectiveness of a non-overlapping Schwarz preconditio- 
ner for the HMC algorithm is studied. The application of the Schwarz alternating 
procedure in lattice QCD has previously been advocated in refs. [19,20], and some 
familiarity with the second of these papers will be assumed here. In order to bring out 
the underlying strategies more clearly, the general structure of the preconditioned 
HMC algorithm is first discussed. The Schwarz-preconditioned algorithm is then 
introduced in sect. 3 and the results of some test runs are reported in sect. 4. 



2. Preconditioned HMC algorithm 

Only the standard Wilson formulation [21] of lattice QCD will be considered in this 
paper, with bare coupling and bare quark mass mo, but the algorithm is set up in 
such a way that 0(a) improvement [22,23] and more complicated gauge actions with 
double-plaquctte terms [24,25] can be easily included. As usual a four-dimensional 
hypercubic lattice of size T x L'^ will be assumed, with lattice spacing a set to unity 
for convenience, and periodic boundary conditions in all directions, except for the 
quark fields, which are taken to be antiperiodic in time. Any unexplained notations 
are as in ref. [20]. 

2.1 Factorization of the quark determinant 

In the familiar case of even-odd preconditioning, the lattice points are ordered such 
that the even ones come first and the (massive) Wilson-Dirac operator D = D^+mo 
then assumes the block form 



in position space. Whenever the Dirac operator is written in this way, its determi- 
nant may be factorized according to 



where the operator in the curly bracket is referred to as the preconditioned Dirac 




(2.1) 



detD = detail det^22 det (l - ^1/^12^22^ ^21} , 



(2.2) 
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operator or (in the mathematical hterature) as the Schur complement of the Dirac 
operator with respect to the block decomposition (2.1). 

In general, preconditioning is always associated with a factorization 

det D = det i?i . . . det (2.3) 

of the quark determinant into the determinants of certain operators Rk- How many 
factors one obtains, and in exactly which spaces of fields the operators act, depends 
on the chosen preconditioner. It is also possible to combine several preconditioners, 

which may result in further factorizations. Just to mention an example, the simple 
polynomial preconditioning of the even-odd preconditioned Wilson-Dirac operator 
D, which was recently considered in rcfs. [16-18], leads to 

Ri = t) + M, R2 = D{D + M)-^, (2.4) 

where M is an adjustable mass parameter. 

2.2 Preconditioned molecular dynamics 

For any given factorization of the quark determinant, the HMC algorithm for two- 
flavour QCD can be set up in the standard manner. First the link momenta n(x,/x) 
and the appropriate pseudo-fermion fields (x) , . . . , ^„ (x) (one for each factor of 
the determinant) need to be introduced. These fields are elements of certain linear 
spaces that are assumed to be equipped with the obvious scalar products. The HMC 
hamiltonian may then be written in the compact form 

n 

H=1{U,U) + Sg + Y1 {Rk'^k,Rk'(Pk) , (2.5) 
fe=l 

where Sq denotes the action of the gauge field. 

Starting from this expression, the forces in the molecular dynamics equations 

— H(x,m) = -^Ffe(x,M), (2.6) 

fc=0 

^U{x,tJ,)=U{x,ti)U{x,tJ,), (2.7) 

are obtained by differentiation with respect to the link variables. They take values 
in the Lie algebra of SU(3) and are such that 

{u;,Fo) = S^Sg, (2.8) 
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(a;,Ffc) = 2Re(i?-Vfc,<5a,i?fcVfc) , k = l,...,n, (2.9) 
for all infinitesimal variations 

S^U{x, n) = u;{x, tJ,)U{x, n) (2.10) 
of the gauge field. 

Among the operators R/. there is normally one, say R^, which is the preconditioned 
Wilson-Dirac operator or a closely related operator. In the course of the numerical 
integration of the molecular dynamics equations, the linear equation 

Rni^ = r] (2.11) 

must be solved many times and this part of the calculation is then likely to consume 
most of the computer time, particularly at small quark masses. An important point 
to note here is that the solution of eq. (2.11) can be obtained by solving the equivalent 
Wilson-Dirac equation, using a preconditioner and a solver that are optimal for this 
task. In the case of the factorization (2.2), for example, there are n = 3 operators, 

Rl=Au, R2=A22, i23 = 1- A"l'^12^22^21, (2.12) 

and eq. (2.11) is equivalent to 

n.-{r). x^(t). (.13, 

The preconditioner used for the solution of this equation and the one that determines 
the factorization of the quark determinant thus do not need to be the same. These 
two kinds of preconditioning should be kept separate and must actually satisfy quite 
different criteria, as will become clear below. 

2.3 Multiple step size integration 

The widely used numerical integration schemes for the molecular dynamics equations 

can be considered to be sequences of elementary steps in which cither all momenta 
n(x, fi) or all link variables U {x, (x) are updated. For a step of size e, these elementary 
operations are 

Tfc(e): Yl{x,^l)^Il{x,^l)-eFk{x,^JL), A; = 0,...,n, (2.14) 

T[/(e) : U{x, //) ^ E (en(x, //)) U{x, //), (2.15) 
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where E(X) stands for the SU(3) exponential function or some suitable approxima- 
tion to it (see appendix A). The leap-frog integration from molecular dynamics time 
to r, for example, amounts to the application of the product 

{To(|e) . . . T„(|e)Tt;(e)To(ie) . . . Inih)}'' > ^ = (2-16) 

to the initial field configuration. As is well known, this popular integrator converges 
with a rate proportional to and satisfies the basic theoretical requirements, i.e. it 
preserves the measure in phase space and is exactly time-reversible. 

If the forces Fq, . . . ,-F„ have sizeably different magnitudes such that, say, ||Fo|| is 
larger than ||i^i||, which is larger than 11^2 1|) and so on, it is possible to accelerate 
the integration by adopting a hierarchical scheme [26,27]. First an integrator 

Mr, No) = {To(|e)T[/(e)To(ie)}'^° , e = (2.17) 

iVo 

is defined, which neglects all quark forces Fi, . . . , Fn. More complicated integrators 
that include an increasing number of these forces are then constructed recursively 
through 

3k{T,No,...,Nk) = 

{lkihPk-iie,No,...,Nk-i)1k{h)}''\ e=^. (2.18) 

The full integrator 3t„(r, A^O; • • • > A„) that is obtained in this way is characterized 
by the time intervals 

T 

''^ " Afc Afc+i . . . A„ 

at which the forces F^ need to be evaluated. 

In practice the step numbers Nk must be such that the shifts eoll and ekFk remain 
small along the molecular dynamics trajectory. The smaller the forces are the larger 

the step sizes can be, and whether a reduction of the computational effort is achieved 
thus depends on the magnitudes of the forces and the difficulty to compute them. 

2.4 Choosing a preconditioner 

In general the preconditioning of the HMC algorithm is beneficial because it leads 
to smaller quark forces along the molecular dynamics trajectories. Ideally the pre- 
conditioning should be such that the forces that are relatively expensive to calculate 
(in terms of the required computer time) are those with the smallest magnitude. 
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(2.19) 



Preconditioning may also soften the scaling behaviour of the simulation algorithm 
with respect to the quark mass. When the chiral limit is approached, the quark 
forces tend to increase, and probably become more irregular too, which requires 
the step sizes to be adjusted accordingly. In a recent study of two-flavour QCD at 
small quark masses by Namekawa et al. [11], for example, the step sizes had to be 
decreased roughly in proportion to the quark mass. The choice of the preconditioner 
can have an influence on this behaviour, as will be shown later in this paper. 

Eventually the performance of the preconditioned algorithm must be determined 
empirically, since the dependence of the autocorrelation times of the quantities of 
interest on the preconditioner is difficult to foresee. The suitability of the algorithm 
for parallel processing and other practical issues may also have to be taken into 
account at this point. 



3. Schwarz-preconditioned algorithm 

The classical Schwarz alternating procedure was previously considered as a precon- 
ditioner for the Wilson-Dirac equation [20]. When combined with the generalized 
conjugate residual (GCR) algorithm, this preconditioner proved to be very efficient, 
and it will now be used, in its simplest form with cycle number ricy = 1, to precon- 
dition the HMC algorithm. 

3.1 Domain decomposition 

Following rcf. [20] the lattice is covered by a regular grid of non-overlapping rect- 
angular blocks A. For technical reasons the block sizes are assumed to be even and 
to be such that the blocks can be chessboard-coloured, i.e. that there is an even 
number of blocks in all dimensions. The union of the black blocks is then denoted 
by VL and the union of the white blocks by 17* (see fig. 1). 

With respect to an ordering of the lattice points where those in $7 come first, the 
Wilson-Dirac operator assumes the block form 



The operator Dq^^ for example, coincides with the Wilson-Dirac operator on VI with 
Dirichlet boundary conditions, while Dqq^ is the sum of all hopping terms from the 
exterior boundary dO. of Q. to the boundary dO.* of O*. 




(3.1) 
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Fig. 1. Two-dimensional cross-section of a 24 x 12^ lattice covered by non-overlap- 
ping 6"* blocks A. The domains f2 and il* are the unions of the black and white blocks 
respectively, and their exterior boundaries dQ and dCl* consist of all points in the 
complementary domain represented by open circles. 

It is often convenient to let these operators act on quark fields that are defined on 
the whole lattice rather than on $7 or J7* only. The extension is done in the obvious 
way by padding with zeros so that eq. (3.1), for example, may be written as 

D = Dii + Dn'+ Don + Doa* . (3.2) 

Similarly the further decompositions into block operators read 

Dn + Dn* = Y^ L»A, (3.3) 

all A 

Dan = Yl ^3^' ^^n- = Yl ^^k, (3.4) 

black A white A 

where Da denotes the Wilson-Dirac operator on the block A with Diriehlet boundary 
conditions and D^a the sum of the hopping terms that move the field components 
on the exterior boundary dk of the block A to its interior boundary points. 

3.2 Quark determinant 
The factorization 

detL> = detL»ndetL»n* det |l - D^^DenD^lDsn*] (3.5) 

is now deduced from the block structure (3.1) as in the case of the even-odd precon- 
ditioning considered in subsect. 2.1. However, contrary to what might be suspected, 
the operator in the curly bracket is not quite the same as the Schwarz-preconditioned 



Wilson-Dirac operator of ref. [20], with cycle number ncy = 1, but their determi- 
nants coincide and the factorization (3.2) may therefore be regarded as the one that 
is naturally associated to the Schwarz preconditioner. 

Some further reductions and factorizations of the quark determinant are still pos- 
sible at this point. The decomposition (3.3), for example, leads to the identity 

det Dn det Dn- = JJ det -Da, (3.6) 

all A 

in which L'a stands for the cvcn-odd preconditioned Wilson Dirac operator on the 
block A with Dirichlct boundary conditions. Another observation is that the oper- 
ator in the curly bracket in cq. (3.5) (the Schur complement) acts non-trivially on 
only those components of the quark fields that reside on dU*. Its determinant can 
therefore be reduced to the space of all fields supported on this subset of points. As 
explained in appendix B, a reduction to an even smaller subspace Vqq* of fields is 
in fact possible when the detailed properties of Dg^' are taken into account. 

Contact with the general form of the preconditioned HMC algorithm discussed in 
the previous section is now made by setting n = 2 and 

R, = Y,Da, (3.7) 

all A 

i?2 = 1 - Pan-D^^Da^D^lDdQ* , (3.8) 

where Pan* denotes the orthonormal projector to the subspace Van* (appendix B). 
In particular, i?2 is considered to be an operator in this space, while Ri operates on 
quark fields that are defined on the even points of the full lattice. 

3.3 Block decoupling 

The classical Schwarz procedure obtains the solution of the Wilson-Dirac equation 
in an iterative process, where all black and all white blocks are visited alternately 

and the equation is solved there with Dirichlet boundary values given by the current 
approximation to the solution [20]. Since the lattice Dirac operator involves only 
nearest-neighbour hopping terms, the equations on the black blocks (and similarly 
those on the white blocks) are completely decoupled from each other and can be 
solved in parallel. 

In the case of the preconditioned HMC algorithm, a partial decoupling of the fields 
on the blocks can be achieved by restricting the molecular dynamics evolution to a 
subset of all link variables, referred to as the active link variables, while keeping all 
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Fig. 2. Two-dimensional view of a G"* block showing the links on which the active 
Unk variables reside. Such links must have both endpoints in the same block and at 
most one endpoint on the interior boundary of the block (open circles). 

other field variables fixed (see fig. 2). At the beginning or the end of every update 
cycle, the gauge field should then be translated by a random vector v, 

U{x, fj,) ^ U{x + V, fj,) for all x,fx, (3-9) 

to ensure that all link variables are treated equally on average. Evidently, the same 
can also be achieved using several block coverings of the lattice alternately, but this 
tends to be rather more complicated from the programming point of view. 

A moment of thought now reveals that the active link variables in different blocks 
are decoupled from each other if the last term in the hamiltonian (2.5) (the one 
involving the operator R2) is neglected. The inner integrator CJi(e2, Nq, Ni) therefore 
factorizes into a product of integrators, one for each block, which evolve the active 
link variables residing there as if QCD would be reduced to that block. In particular, 
these parts of the integration can be carried out in parallel. 

More important perhaps is the fact that the restriction to the active link variables 
leads to a reduction of the magnitude of the quark force that derives from the last 
term in the hamiltonian. This effect was anticipated in ref. [19] and will be discussed 
again in subsect. 4.2. 

3.4 Update cycle 

The generation of the next gauge-field configuration U' from the current configura- 
tion U thus proceeds in the following steps: 

(a) First the initial momenta n(x,/x) and the pseudo-fermion fields (pi{x) and 4'2ix) 
must be generated, with conditional probability proportional to . It suffices to 
generate the momenta of the active link variables, since only these will be updated. 
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The field (f)i breaks up into block fields, 



01 = ^ ^A, = -Da??a, (3.10) 

all A 

where r]\ is a gaussian random field supported on the even sites of block A, and the 
field (f)2 is similarly given in terms of a gaussian random field 772 through 

02 = R2V2- (3.11) 

Both 02 and r/2 are in the linear space Vqq* of boundary fields (appendix B). 

(b) The molecular dynamics equations for the active link variables and their mo- 
menta must now be integrated from time to time r, the trajectory length. In the 
studies reported in this paper, the integration was performed by applying the inte- 
grator 32iT, -^0) ^2) with some fixed step numbers Nq, Ni, N2- The computation 
of the forces Fq and Fi, which is required in this process, is entirely standard, while 
in the case of the force F2 the key point to note is that 

R^^ = 1 - Pea-D-^Dga'. (3.12) 

In particular, for all variations a; of the active link variables, 

{io,F2) = 2Re{R^'<l>2,Pdn*D-^5^DD-^Dda^cl)2) , (3.13) 

and the calculation of the force thus amounts to solving the Wilson-Dirac equation 
on the full lattice for two source fields. As already mentioned in sect. 2, any suitable 
preconditioner and solver can be used at this point. Here the Schwarz-preconditioncd 
GCR solver described in ref. [20] was employed, with its own block grid chosen so 
as to minimize the average execution time. 

(c) Once the molecular dynamics equations arc integrated, the final configuration is 
accepted as the next one with probability 

Pacc = min{l,e-^-^}, AH = H' - H, (3.14) 

where H and H' are the values of the hamiltonian of the initial and the final config- 
uration respectively. Otherwise, i.e. if the proposed configuration is rejected, the old 
configuration is taken to be the next one. In both cases the selected configuration 
is translated by a random vector v, as discussed above, where v should be chosen 
with some care (see appendix C). 
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3. 5 Choice of the block sizes 

In principle any division of the lattice into non-overlapping blocks is acceptable, but 
the algorithm tends to become inefficient if the blocks are only a few lattice spacings 
wide, because in this case only a small fraction of the link variables get updated in 
each cycle. When the lattice is divided into 6"^ blocks, for example, the active link 
variables are a minority of 25%, and this percentage rises only slowly with the block 
size, reaching values greater than 50% when 12^ and larger blocks are used. 

Block sizes larger than Ifm or so should also be avoided, however, because the 
Dirichlet boundary conditions then no longer provide a safe infrared cutoff on the 
spectrum of the block operators Z)a- The intended separation of the high and low 
modes of the Wilson-Dirac operator through the preconditioning is thus compro- 
mised and the algorithm slows down since the computation of the force Fi becomes 
relatively time-consuming. 

To sum up, the block sizes should be less than 1 fm, but as large as possible within 
this limit so as to maximize the fraction of the active link variables. 



4. Tests of the algorithm 

All simulations reported in this section were carried out on 8 nodes (16 processors) 
of a recent PC cluster, the same as the one described in appendix B of ref. [20]. The 
general programming strategies discussed in that paper were adopted here too and 
some further implementation details are given in appendix D. 

4-1 Run parameters 

Extensive simulations of two-flavour lattice QCD with the gauge and quark actions 
chosen here were previously performed by the SESAM and the TxL collaborations 
[1-4] and, more recently, in the framework of the GRAL project [5,6] using the even- 
odd preconditioned HMC algorithm or the plain HMC algorithm with an 11-SSOR 
preconditioner [28] for the solver. Since most results of these studies were obtained 
at coupling f5 = Q/qq = 5.6, it was decided to carry out the tests of the Schwarz- 
preconditioned algorithm at this value of the coupling. Two lattices and three values 
of the hopping parameter «; = (8 -|- 2mo)~^ were considered, the one corresponding 
to the smallest quark mass being a new point (see table 1; for notational convenience 
the three runs at the smallest mass on the 32 x 24^ lattice are distinguished by an 
upper index on k). 
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Table 1. Simulation parameters 



Lattice 


Block size 


K 


No,Ni,N2 




p 

^ acc 


32 X 163 


84 


0.15750 


4,5,4 


8000 


0.81 






0.15800 


4,5,5 


9100 


0.86 






0.15825 


4,5,6 


9400 


0.90 


32 X 243 


8 X 6^ X 12 


0.15750 


4,5,5 


4000 


0.82 






0.15800 


4,5,6 


3950 


0.80 






0.15825'^ 


4,5,7 


800 


0.71 






0.15825^ 


4,5,8 


2300 


0.78 






0.15825'= 


4,5,10 


1100 


0.87 



In physical units the lattice spacings on these lattices decrease monotonically with 
the quark mass from 0.085 fm to about 0.078 fm [3] if the Sommer radius tq = 0.5 fm 
[29] is used as the basic reference scale f. The bare current-quark masses are then 
approximately 63,35 and 21MeV (see subsect. 4.6). Cutoff effects can still be fairly 
large at these lattice spacings, and a more meaningful statement about the quark 
masses would evidently have to include the appropriate renormalization factors, but 
these figures nevertheless provide a rough characterization of the physical situation 
at the simulation points. 

Following ref. [1] the trajectory length r was set to 0.5 in all cases. The mean dis- 
tance in group space covered by the active link variables in the course of a trajectory 
is then fairly large already (about a third of the maximal distance). Initial runs of a 
few thousand trajectories were made on all lattices to ensure a safe thermalization, 
and ATtr trajectories were generated thereafter, as quoted in table 1. 

^.2 Magnitude of the forces 

To a first approximation it seems reasonable to fix the step numbers A^o ■, 1 ^2 in 
such a way that the increments eoi^O; ei-^^ij ^2-^2 are roughly of the same size. The 
magnitude of the forces was determined in the test runs, and the results of these 
calculations on the smaller lattices are shown in fig. 3. For the norm of an element 

f If the mass of the p meson was used instead, shghtly larger values would be obtained. However, 
since the p meson is a resonance and since its mass depends quite strongly on the quark mass, this 
way of setting the scale tends to be somewhat ambiguous in full QCD and subject to potentially 
large finite-size effects. 
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Fig. 3. Average magnitude of tlie forces Fk at the end of the molecular dynamics 
trajectories on the 32 x 16^ lattice at distance d from the block boundary. Dotted lines 
are drawn to guide the eyes, and the three curves for F2 correspond to the different 
quark masses (the mass dependence of the other curves is negligible). 

X of the Lie algebra of SU(3), the convention = — 2tr{X^} was adopted here, 

and an average was taken over all gauge configurations and all links at the specified 
distance from the block boundary. 

The magnitudes of the forces are thus strongly ordered, approximately like 32 : 
8:1, and II-F2II decreases rapidly as one moves towards the centre of the block. 
Moreover, a significant mass dependence is only seen in this force and only at the 
links where it is already very small. This behaviour can be understood by noting 
that the expression on the right of eq. (3.13) involves two quark propagators from 
the boundaries dQ* and dQ to the link where the force is evaluated. The decay 
of the force away from the block boundary is then essentially a consequence of the 
corresponding property of the propagators. 

Figure 3 suggests to set A^q = 4, A^i = 8 and to adjust N2 so that a good acceptance 
rate is achieved. Some experimenting showed, however, that Ni = 5 appears to be 
a better choice on the lattices that were simulated. For the values of A^2 quoted in 
table 1, the step size €2 was then in the range 0.05 — 0.13 and the average magnitude 
of the increments e^Ffc was less than 0.031 in all cases. 

4-3 Residues & reversibility 

The computation of the quark forces Fi and F2 requires the solution of the Wilson- 
Dirac equation on the blocks A and on the full lattice, for two source fields in each 
case. On the full lattice, the Schwarz-preconditioned GCR solver [20] was applied 



13 



two times to the Wilson-Dirac equation directly, while on each block A the solutions 
were obtained simultaneously by solving the normal equation 



Al/j = T], 



(4.1) 



using the conjugate-gradient (CG) algorithm. In this case the algorithm was stopped 
when the approximate solution ijj satisfied 



for some specified tolerance ri. The analogous stopping criterion was applied, with 
tolerance r2, when the equation on the full lattice was solved. 

At the beginning and the end of the molecular dynamics trajectories, the Wilson- 
Dirac equation must be solved a few more times, both on the blocks and on the full 
lattice, with tolerances fi and f2 respectively. The BiCGstab algorithm [30,31] was 
used here to solve the block equations, since the solution is required for one source 
field only so that the application of the CG algorithm would be wasteful. 

The tolerances chosen in the test runs. 



are sufficiently small to guarantee the reversibility of the molecular dynamics trajec- 
tories to high precision. The maximal absolute deviation of the components of the 
link variables, for example, which was ever observed after a return trajectory was 
only 7 X 10~^ on the smaller and 7 x 10""^^ on the larger lattices, while in the case of 
the hamiltonian the differences were less than 3 x 10~^ and 5 x 10~^, respectively. 

4-4 Stability 

It is well known that instabilities in the numerical integration of the molecular dy- 
namics equations may occur at small quark masses, which lead to violent fluctuations 
in AH and to a stagnation of the algorithm in extreme cases. Choosing smaller in- 
tegration step sizes usually cures the problem, but also increases the computer time 
required per trajectory (see refs. [8,11] for example). 

In the tests of the Schwarz-preconditioned algorithm, severe integration instabili- 
ties were only rarely seen in spite of the fact that the step size €2 was set to relatively 
large values. A small algorithmic modification, referred to as the replay trick [32], 
was nevertheless included in order to safely avoid periods of stagnation. Essentially 



77 — AipW < Ti \\r] 



(4.2) 




10"'^, 10"^, 10-1°, 10-9 on the 32 x 16^ lattices, 
10-^ 10-^, 10-i\ 10-1° on the 32 x 24^ lattices. 



(4.3) 
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Fig. 4. Histories of AH and of the deviation AP of the average plaquette from 
its mean value at k = 0.15800 on the 32 x 24^ lattice as a function of the trajectory 
number. Only the values after every 5th trajectory are plotted. 

the modification amounts to replaying the trajectories with values of \AH\ above a 
specified threshold, using a smaller step size and a somewhat complicated acceptance 
rule that preserves detailed balance. 

For illustration, a typical time series of AH (after including the replay trick) is 
shown in fig. 4. The replay threshold was set to 1.5 in this case and the overhead in 
computer time that was caused by the trajectory replays was about 2%. Also shown 
in the figure are the fluctuations of the average plaquette 

p p 

where the sum runs over all Np plaquettes on the lattice and Up denotes the ordered 
product of the link variables around the plaquette p. As will be discussed shortly, 
these fluctuations are quite correlated, but otherwise do not seem to have any special 
features. In particular, the associated histogram has a nearly gaussian shape. 

The fact that the (massive) Wilson-Dirac operator is not protected from having 
arbitrarily small eigenvalues was always a source of concern and may well be the 
cause for the integration instabilities [11]. An interesting quantity to consider in this 
connection is the average number (iVccR) of Schwarz-preconditioned GCR iterations 
that are required for the solution of the Wilson-Dirac equation on the full lattice 
along a given trajectory. While catastrophically large iteration numbers were never 



15 




Fig. 5. Probability distribution of the average iteration number {Nqcr) per appli- 
cation of the Schwarz-preconditioned GCR solver along a given trajectory. From left 
to right the results obtained on the 32 x 24^ lattice at k = 0.15750, 0.15800, 0.15825*' 
are shown. 



observed, the distributions shown in fig. 5 have a tail that extends to fairly high 
values of (A'^qcr) Sit the smallest quark mass. The replay overhead on the 32 x 24^ 
lattice was actually as large as 20-30% at this mass, except in the last run where 
the replay threshold was set to 3.0 and no trajectories were replayed. 

4-5 Autocorrelation times 

An estimation of the relevant autocorrelation times is clearly essential in the present 
context, since the efficiency of the algorithm can only be determined when these are 
known. Unfortunately the available statistics is insufficient for a solid "error of the 
error" analysis [33], and the values for the integrated autocorrelation times quoted 
below should therefore be taken as first estimates only. 

The autocorrelation functions of the average plaquette P and the average number 
{Nqck) of GCR iterations are plotted in fig. 6. An approximation to the four- 
point autocorrelation function introduced by Madras and Sokal [34] was used here 
to determine the statistical errors (see appendix E). As usual the associated inte- 
grated autocorrelation times are obtained by summing the autocorrelation functions 
up to some maximal time lag, referred to as the summation window. The particu- 
lar prescription adopted here was to truncate the sums at the first point where the 
autocorrelation function vanishes within errors. In general this rule yields consistent 
results and summation windows a few times larger than the calculated autocorrela- 
tion times. 

The values of the integrated autocorrelation times listed in table 2 seem to be quite 
a bit higher than those obtained in previous studies of two-flavour QCD, but the 
comparison is not straightforward and must in any case remain uncertain to some 
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Fig. 6. Normalized autocorrelation functions of the average plaquette P (dark error 
band) and the average number (A^gcr) of GCR iterations (grey band). In the first 
and the second row of plots the lattice size is 32 x 16^ and 32 x 24^ respectively, and 
the hopping parameters increase from left to right as in fig. 5. 



extent in view of the large statistical errors (see subsect. 4.7). It should also be 
noted that the solver iteration numbers are usually a worst case, and more physical 
quantities (hadron masses in particular) typically have much smaller integrated auto- 
correlation times. 

Whether there is a systematic trend in the autocorrelation times as a function 
of the quark mass or the lattice size is difficult to tell from the figures listed in 
table 2. There is, however, no indication that the autocorrelation times grow when 
the quark mass is lowered. Some of the runs were perhaps a bit too short to safely 
exclude biased results, and longer simulations may be required to firmly establish 
the decrease of the autocorrelation times as a function of the step size €2 suggested 
by the last three entries in the table. 



4-6 Quark and pion masses 

In this algorithmic study, the computation of the current-quark mass and the pion 
mass is not of central interest, but it helps to determine the physical situation on 
the simulated lattices, as discussed at the beginning of this section. 

Since a very accurate determination of the masses is not required, error reduction 
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Table 2. Integrated autocorrelation times * 



Lattice 


K 


ri„t[P] 


ri„t[(A^GCR)] 


32 X 16^ 


0.15750 


68(25) 


168(42) 




0.15800 


32(7) 


162(56) 




0.15825 


57(18) 


135(39) 


32 X 24^ 


0.15750 


53(22) 


144(51) 




0.15800 


33(11) 


122(36) 




0.15825=^ 


33(13) 


84(24) 




0.15825'^ 


19(5) 


65(20) 




0.15825^= 


12(4) 


22(6) 



* In numbers of trajectories of length r = 0.5 



techniques were not applied and the masses were obtained straightforwardly from 
the two-point functions of the isospin axial current and density. In terms of the 
quark field tp and the antiquark field ijj, the latter are given by 

a; = V^7M75^r>, = ^Ts^T-^V, (4.5) 

where r^,r^,r^ denote the isospin Pauli matrices. As usual the two-point functions 
were evaluated at vanishing spatial momentum, 

J2 <P"(x)P^O)> ^ i<5"Vpp(^o), (4.6) 
^ <Ag(x)P^(0)> = i(5«VAp(xo), (4.7) 

and the appropriate signed average of the calculated functions at time xq and T — xq 

was taken. 

In the test runs, the gauge-field configurations generated after every 100 update 
cycles were saved to disk, allowing physical quantities such as the quark and the pion 
mass to be calculated later. The fields in these ensembles are only weakly correlated, 
and no systematic increase in the naive statistical errors was in fact observed when 
the measured values of the two-point functions /pp(a;o) and /Ap(a^o) were averaged 
over small bins of successive configurations. For the mass calculations the values of 
the two-point functions were therefore assumed to be statistically independent and 
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Table 3. Average plaquette, current-quark mass and pion mass 



Lattice 


K 


P 


am 




32 X 16^ 


0.15750 


0.57255(6) 


0.0279(4) 


0.298(5) 




0.15800 


0.57335(4) 


0.0153(4) 


0.242(4) 




0.15825 


0.57387(5) 


0.0092(4) 


0.209(7) 


32 X 24^ 


0.15750 


0.57250(4) 


0.0273(4) 


0.280(3) 




0.15800 


0.57344(3) 


0.0143(3) 


0.188(5) 




0.15825^'^ 


0.57383(2) 


0.0084(4) 


0.153(4) 



the results were checked by repeating the calculations with the data averaged over 
bins of 2 or 3 measurements. 

The conservation of the axial current in the continuum limit implies that the ratio 

{/ap(xo + 1) - /ap(xo - l)}{4/pp(xo)}-' (4.8) 

is independent of xq and equal to the bare current-quark mass m, up to terms that 
vanish proportionally to the lattice spacing. In the range 8 < xq < 16 the data 
for the ratio were actually found to be constant within errors. The quark mass was 
then extracted through a correlated least-squares fit of the ratio in this range, using 
jackknife error estimates for the covariance matrix. 

The pion mass m^^ was calculated in the same way by fitting the effective mass 
fnesixo) at large times xq. At any given xq, the effective mass is obtained by solving 
the equation 

h{xo) /pp(a;o) 

for M and setting m^sixo) = M. The associated covariance matrix was again cal- 
culated using jackknife bins and the fit ranges were chosen so that a good fit quality 
was obtained. 

Where a comparison is possible, the results listed in table 3 coincide with those 
published by the SESAM, T^L and GRAL collaborations [1-6] within 1 or 2 sigma, 
an exception being a deviation by 3 sigma of the pion mass at k = 0.15750 on the 
32 X 16"^ lattice. The quark masses change only slightly with the lattice size, as it 
should be close to the continuum limit, while the size-dependence of the pion mass 
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Fig. 7. Average number (A'cf) of accepted gauge- field configurations generated per 
day by the Schwarz-preconditioned HMC algorithm on 8 nodes of a recent PC cluster 
(described in ref. [20], appendix B), as a function of the bare current-quark mass. The 
point at the smallest mass on the 32 X 24^ lattice is from the last run in table 1 and 
the linear scaling curves (dotted lines) are drawn for comparison with the data. 



is not small on these lattices [5,6]. In physical units the quark and the pion masses 
quoted on the last line of table 3 are about 21 MeV and 386 MeV respectively (if the 
scale is set through the Sommer radius). 

4-7 Speed and efficiency of the algorithm 

In practice the average number of accepted gauge-field configurations generated per 
day on the computer that is used for the simulations is a relevant performance figure. 
As can be seen from fig. 7, the experience made with the Schwarz-preconditioned 
HMC is quite encouraging in this respect, given the fact that the tests were per- 
formed on a relatively small machine. In the most time-critical parts of the program, 
the machine delivered about 26 Gflop/s [20], and the total computer time spent for 
the tests was a little more than half a year. 

At the smallest quark mass that was considered, most of the time is used for the 
solution of the Wilson-Dirac equation on the full lattice. The situation is different 
at the larger quark masses, where the solver converges faster while the time spent 
in the other parts of the program is nearly unchanged. For this reason, and since 
the step number A^2 had to be adjusted slightly, the configuration production rate 
does not follow a simple scaling law with respect to the quark mass. In particular, 
the three lower points in fig. 7 are only accidentally fitted by the dotted line. 

An interesting measure for the efficiency of the Schwarz-preconditioned algorithm 
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Table 4. Values of the cost figure v 



Lattice 


n 


u 


32 X 16^ 


0.15750 


0.75(28) 




0.15800 


0.42(9) 




0.15825 


0.86(27) 


32 X 24^ 


0.15750 


0.69(29) 




0.15800 


0.50(17) 




0.15825^ 


0.56(22) 




0.15825*^ 


0.36(10) 




0.15825<= 


0.28(9) 



is provided by the cost figure 

z. = 10-=^(2A^2 + 3)ri„t[P]. (4.10) 



On average, this is the number of times the Wilson-Dirac equation must be solved 
on the full lattice, in units of thousands, before a statistically independent value of 
the average plaquette P is obtained. Evidently v is independent of the machine, the 
program and the particular solver which arc used. The numbers listed in table 4 
show that the algorithm is remarkably stable in these terms, and there certainly is 
no indication, in the parameter range considered and within the errors quoted, that 
it would slow down towards smaller quark masses or larger lattices. 

If another or no preconditioner is used for the HMC algorithm, the cost figure u 
may be defined in the same way, with replaced by the appropriate step number. 
In refs. [1-6], for example, the plain HMC algorithm was studied at the same gauge 
coupling and hopping parameters k = 0.15750 and k = 0.15800 which were consid- 
ered here. On the 32 x 16^ lattice the corresponding values of v, as calculated from 
the run parameters and autocorrelation times quoted in ref. [6], are equal to 1.0(6) 
and 2.8(8) respectively. These numbers rise to 1.8(8) and 5.1(5) on the 40 x 24^ 
lattice, which shows quite clearly that the efficiency of the plain HMC algorithm 
decreases when the lattice size is increased or if the quark mass is taken to smaller 
values. In particular, already at k = 0.15800 (where the pion mass is approximately 
465 MeV) the Schwarz-preconditioned algorithm is much more efficient. 

Although different lattice formulations were chosen in most cases, it is interesting 
to note that large values of u are common to all previous simulations of two-flavour 
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QCD at small quark masses. In a study conducted by the UKQCD collaboration [8], 
for example, using the even-odd preconditioned HMC algorithm, the cost figure on 
a 32 X 16'^ lattice at ~ 420 MeV was 5.5(11). Significantly smaller masses were 
reached by the CP-PACS collaboration on a coarse 24 x 12'^ lattice [11], and in these 
simulations v ranged from 9.4(18) at m-„ ~ 398 MeV to 29(10) at m^r 255 MeV. 



5. Concluding remarks 

The Schwarz-preconditioned HMC algorithm is intended for simulations of lattice 
QCD at small lattice spacings and small quark masses. It is particularly well suited 
for parallel processing, which can be a decisive advantage when the continuum limit 
is approached and very large lattices have to be simulated. Many of the theoretically 
expected properties of the algorithm were confirmed in the tests reported in this 
paper, and although important uncertainties remain, there are clear indications that 
the algorithm does indeed have a favourable scaling behaviour with respect to the 
quark mass. 

So far only the standard Wilson formulation of lattice QCD was considered, but 
0(a) improvement and double-plaquette terms can be easily included in the Schwarz- 
preconditioned algorithm. Whether its excellent scaling properties will be preserved 
is difficult to foresee, however, because the autocorrelation times are unpredictable. 
Independently of the lattice formulation, the algorithm is not expected to perform 
particularly well at large quark masses or at lattice spacings larger than 0.1 fm or 
so. The problem on coarse lattices simply is that all choices of the block sizes tend 
to be inappropriate from one or the other point of view (cf. subsect. 3.5). 

The particular Schwarz preconditioner that was considered here is only one exam- 
ple of a domain decomposition preconditioner. Close to the continuum limit more 
complicated preconditioner s, based on a hierarchy of block grids perhaps or alter- 
native domain decompositions, may conceivably be more efficient. The potential for 
further accelerations along these lines is limited, however, until significantly larger 
step numbers are required than those quoted in table 1. 

I am indebted to Rainer Sommer for many helpful discussions on QCD simulation 
algorithms and to Ulli Wolff for correspondence on the analysis of simulation time 
series. Thanks also go to Boris Orth for sending a copy of his PhD thesis prior to 
publication, and to him and Thomas Lippert for correspondence on the SESAM, 
T^L and GRAL projects. The simulations reported in this paper were carried out 
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on a PC cluster at the Institut fiir Theoretische Physik der Universitat Bern, which 
was funded in part by the Schweizerischer Nationalfonds. I thank both institutions 
for the continuous support given to this project. 



Appendix A. SU(3) exponential function 



In the course of the numerical integration of the molecular dynamics equations, the 
link variables arc updated by left-multiplication with a matrix function E(X), where 
X is proportional to the field momentum at the given link. E(X) is usually taken 
to be the exponential function in SU(3), but there are alternative choices that are 
theoretically equally acceptable and easier to implement. 

To ensure that the leap-frog integration will be well-defined, reversible and quadra- 
tically convergent, the following properties must hold: 

(1) E(X) is a smooth mapping from the Lie algebra of SU(3) to the group. 

(2) The function satisfies E(X)E(-X) = 1 for all X. 

(3) At small t the asymptotic expansion E(iX) = l + tX + ^t'^X^ + O(i^) holds. 
All these requirements can be met by setting 



E{X) = UiU2U3U2Ui, (A.l) 

where Ui, U2 and U3 are certain SU(2) rotations in the (12), (13) and (23) subspaces 
respectively. They are obtained by representing the anti-hermitian traceless matrix 
X as a sum of the three matrices 



xi X12 0' 





X2I 




-xi 




{ 









f X2 





Xi3 




















-X2 

















X3 


X23 




Vo X32 


~X3 



xi = k (Xii - X22) , (A.2) 



X2 = h {Xn - X33) , (A.3) 



X3 = k {X22 - X33) . (A.4) 
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Evidently, these are generators of the SU(2) subgroups just mentioned, and it is 
then straightforward to prove that 

c/i = (i + iyi)(i-iyi)-\ (A.5) 

U2 = {l + lY2){l-\Y2y\ (A.6) 

Us = + '^Ys) [1 - '^Ys)-' (A.7) 

is a possible choice of the matrices Uk- The inversions in these equations can be 
worked out analytically and are numerically safe, since the l^'s have purely imag- 
inary eigenvalues. In particular, it is straightforward to write a fast program that 
computes E(X) for any given X practically to machine precision. 



Appendix B. Boundary fields 

In this appendix the space Vgn* of boundary fields is specified explicitly. The choice 
of this space is actually not unique, but it should be such that the associated or- 
thonormal projector Pan* satisfies 

Dan- Pan* = Don-. (B.l) 

This property guarantees that the determinant of the Schur complement in eq. (3.5) 
coincides with the determinant of the projected operator R2, which is all what is 
assumed in sects. 3 and 4. 

The action of the operator Dqq* on an arbitrary quark field i/j{x) is given by 

3 

Ddn-ipix) = -en*{x)^{^{l-j^,)ea{x + fi)U{x,fi)'il;{x + il) 

IJ.=0 

+^ (1 + 7m) - fi)U{x - (i,^i)-^^l^{x - /i)}, (B.2) 

where 9q,{x) and 6n*{x) are the characteristic functions of CI and Q*. As illustrated 
by fig. 8, the terms on the right-hand side of this equation move the Dirac spinors 
from dQ* to 50 and multiply them with the projectors |(1±7^) and the appropriate 
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Fig. 8. The hopping terms in Dg^i* move the quark spinors from the interior to the 
exterior boundary points of the black blocks. Depending on the position of the inner 
point, the spinor is moved in one or more directions. 

link variables. In particular, two components of the spinors residing on the subset 

[50*] = {x G dfl* \ X + fi e dCl or x — /t G dfl for one value of fi only} (B.3) 

are lost irretrievably through the application of the projectors. 
It is now straightforward to show that the operator 



(0 ifx^dn*, 

|(l + 7^)V'(a;) a X e [dn*] and X + fL e dn, 

^(1 - 7^)'0(a;) if X G [dQ*] and x - fi e dft, 

, ip{x) otherwise. 



(B.4) 



satisfies cq. (B.l). Moreover, this choice excludes the trivial nullspace of Dqq* and 
thus minimizes the dimension of the space Vga* of boundary quark fields. 



Appendix C. Translation vectors 



The field translations at the end of the update cycles of the Schwarz-preconditioned 
algorithm should ideally be such that all link variables are visited at approximately 
the same rate. In particular, long gaps between subsequent updates of a given link 
variable should be avoided as far as possible. One might think that a purely random 
choice of the translation vectors v will be acceptable from this point of view, but 
the visiting frequency on some links can actually be far below average in this case, 
even after hundreds of update cycles. 
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Fig. 9. Three copies of a block grid in two dimensions, two of them being translated 
along a diagonal direction at regular distances. Taking the boundary conditions into 
account, the lattice is completely covered by the blocks in these grids. 

More balanced visiting frequencies can be obtained by adopting a quasi-random 
scheme, where a purely random choice of v is followed, in the next few update cycles, 
by translations along a diagonal lattice direction in regular steps (see fig. 9). If the 
lattice is divided into blocks of size 8"^, for example, performing a random translation 
and then three times a translation by the vector 

« = (2,2,2,2) (C.l) 

yields a fairly narrow visiting distribution. For general block sizes 6^, it suffices to 
take n — 1 regular steps by the vector 

v = {hoMMM)/n (C.2) 

(or close-by integer vectors), where n = \ min{10, 6o, ^'i, ^s}- 



Appendix D. Programming issues 

The programming of the Schwarz-preconditioned algorithm is not completely trivial, 

particularly so on parallel computers, where the block decomposition of the lattice 
gives rise to additional complications. Object orientation and generic programming 
are useful general strategies that may be applied here [20], and some more specific 
issues will now be addressed. 

D.l Parallelization 

Clearly the communication overhead is minimized if each block in the domains Q and 
n* is fully contained in one of the sublattices on which the processors of a parallel 
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computer operate. This was the case in the test runs reported here, but more flexible 
assignments may eventually be needed, where the processor sublattices divide the 

blocks in some way. 

On machines with symmetric multiprocessing nodes, for example, the natural 
choice is to contain the blocks in the node sublattices. In this case the processors 
can operate on the block fields in a shared-memory mode and the network is only 
used when the global fields get involved. In general the communication overhead can 
always be reduced by dividing the blocks into smaller blocks that are fully contained 
in the processor sublattices. The Schwarz-preconditioned GCR solver may then be 
set up on this block grid and be used for the solution of the Wilson-Dirac equation 
on both the full lattice and the big blocks. 

D.2 Block fields 

In the program a block may be defined through a data structure that contains the 
relevant geometrical information (embedding and nearest- neighbour indices), the 
local gauge field and a set of quark fields. An obvious advantage of this approach 

is that the operations on these fields can proceed locally without having to gather 
and scatter any data from and to the global fields. With a proper layout of the local 
fields in memory, streaming data accesses, and hence a higher processing speed, are 
thus made possible. 

Most computations are actually local operations, an exception being the calcula- 
tion of the force F2, which requires the solution of the Wilson-Dirac equation on 
the full lattice. The current values of the gauge fields on the blocks may need to be 
copied to the global fields at this point, but this is a minor complication, since the 
fields must be copied only once per force calculation. 

D.3 Single-precision acceleration 

In the course of the numerical integration of the molecular dynamics equations, an 
exponential amplification of rounding errors is possible and important significance 
losses are in any case unavoidable when the difference AH is calculated at the end 
of the trajectories. The algorithm was therefore implemented using double-precision 
(64 bit) data and arithmetic for all fields. 

Through the intermediate use of single-precision data and arithmetic, the CG and 
the Schwarz-preconditioned GCR solvers for the Wilson-Dirac equation can, how- 
ever, be accelerated without compromising the precision of the final results [35,20]. 
On current PC processors, where the relevant single-precision programs arc signif- 
icantly faster than their double-precision versions, an acceleration by a factor 2 or 
so can be achieved in this way. 
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Appendix E. Calculation of autocorrelation times 

In practice the calculation of autocorrelation times tends to be somewhat ambiguous 
unless the available time series arc very long. The particular prescriptions that were 
used in this paper are described in the following for the case of a primary observable 
A such as the average plaquette. Essentially the discussion follows refs. [33,34] where 
further details and alternative strategies can be found. 

E.l Preliminaries 

Let ai, a2, . . . , ajv be a time series of measurements of A obtained in the course of a 
numerical simulation. It is helpful to imagine that infinitely many such simulations 
were made, with independent random numbers so that the different runs may be 
assumed to be uncorrelated. The true expectation value of A is then 



where (. . .) denotes the average over the infinite set of uncorrelated simulations. 
This value should be distinguished from the average 



which is obtained in a particular simulation. From the point of view of the set of 
simulations, a is a stochastic variable with expectation value (a) = a. 
In terms of the true autocorrelation function 



a = {ai), 



(E.l) 




(E.2) 



i=l 



r{t) = r{-t) = {{ai - a){ai+t - a)) , 



(E.3) 



the statistical variance of the measured value a of ^ is given by 




(E.4) 



At large N this equation may be written in the form 



^2 2 
<7 = ^rint<7o 



(E.5) 
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where ctq = T{0)/N is the naive variance and 



1 ^ r(t) 

the integrated autocorrelation time. 

E.2 Madras-Sokal approximation 
The expression 

^ N-t 
i=l 

provides an approximation to the true autocorrelation function, which can be com- 
puted from the available time series. Its expectation value coincides with T(t) up to 
terms that are negligible (at large A'') with respect to the deviation 



sm = m - im) , (e.s) 

whose variance determines the statistical error of T{t). 

The covariance of ST[t) involves a sum over the four-point autocorrelation func- 
tion, which is usually difficult to compute precisely from the available data alone. 
However, as first noted by Madras and Sokal [34], the disconnected parts of the 
four-point function make the dominant contribution to the sum, and if only these 
are kept, the covariance becomes 

^ oo 

{dT{t)6r{s)) ^ ^ E {^(k)r{k + t-s) + rik + t)T{k - s)}, (e.9) 

fc= — oo 

where t,s <^ N was assTimed and any subleading terms were neglected. It is then 
also possible to derive the elegant formula 

^ oo 

{Sp{tf) ^ - Y,{pik + t)+ p{k -t)- 2p{k)p{t)y (E.IO) 
k=i 

for the variance of the normalized autocorrelation function p{t) = r(t)/r(0). 
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E. 3 Practical procedure 



Equations (E.6) and (E.IO) can be evaluated by substituting pit) = T{t)/V{Qi) for 
the normalized autocorrelation function and truncating the sums at some sufficiently 
large time lag. In the case of the variance of the autocorrelation function, 



t+A 

(M0'> ^ ^ E{^"(^ + ^) + ~p^^ - - '^Pimt)}\ (E.ll) 



fe=i 



the choice of the cutoff A is not critical and values of A > 100 gave consistent results 
for all error bands plotted in fig. 6. 

The integrated autocorrelation time may then be determined through the sum 



1 ^ 

riut = ^ + Y,p{t), (E.12) 
^ t=i 

in which the summation window W is set to the first time lag t where 

p{t)<{5p{tff'\ (E.13) 

Other criteria [33,34] could be applied at this point and may be preferable when the 
statistics allows for an optimization of the summation window. The Madras-Sokal 
formula 

(irL) - '^rl., (E.14) 

may finally be used to estimate the statistical error of the autocorrelation time. 
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